Sequential conditional generalized iterative scaling

ABSTRACT

A system and method facilitating training machine learning systems utilizing sequential conditional generalized iterative scaling is provided. The invention includes an expected value update component that modifies an expected value based, at least in part, upon a feature function of an input vector and an output value, a sum of lambda variable and a normalization variable. The invention further includes an error calculator that calculates an error based, at least in part, upon the expected value and an observed value. The invention also includes a parameter update component that modifies a trainable parameter based, at least in part, upon the error. A variable update component that updates at least one of the sum of lambda variable and the normalization variable based, at least in part, upon the error is also provided.

TECHNICAL FIELD

[0001] The present invention relates generally to machine learning, andmore particularly to a system and method employing sequentialconditional generalized iterative scaling for training machine learningsystem(s), especially systems using so-called maximum entropy models,logistic regression, or perceptrons trained to minimize entropy.

BACKGROUND OF THE INVENTION

[0002] Machine learning is a general term that describes automaticallysetting the parameters of a system so that the system operates better.One common use for machine learning is the training of parameters for asystem that predicts the behavior of objects or the relationship betweenobjects. An example of such a system is a language model used to predictthe likelihood of a sequence of words in a language.

[0003] One problem with current machine learning is that it can requirea great deal of time to train a single system. In particular, systemsthat utilize Maximum Entropy techniques to describe the probability ofsome event tend to have long training times, especially if the number ofdifferent features that the system uses is large.

[0004] Conditional Maximum Entropy models have been used for a varietyof natural language tasks, including Language Modeling, part-of-speechtagging, prepositional phrase attachment, parsing, word selection formachine translation, and finding sentence boundaries. Unfortunately,although maximum entropy (maxent) models can be applied very generally,the conventional training algorithm for maxent, Generalized IterativeScaling (GIS) can be extremely slow.

Discussion of Generalized Iterative Scaling Conditional Maxent Modelsare of the Form

[0005] $\begin{matrix}{{P\left( y \middle| \overset{\_}{x} \right)} = \frac{^{\sum\limits_{j}\quad {\lambda_{j}{f_{j}{({\overset{\_}{x}|y})}}}}}{\sum_{y^{\prime}}\quad {\exp {\overset{\quad}{\sum_{i}}\quad {\lambda_{i}{f_{i}\left( {\overset{\_}{x},y^{\prime}} \right)}}}}}} & (1)\end{matrix}$

[0006] where {overscore (x)} is an input vector, y is an output, theƒ_(i) are feature functions (indicator functions) that are true if aparticular property of {overscore (x)}, y is true, and ?_(i) is atrainable parameter (e.g., weight) for the feature function f₁. Forexample, if trying to do word sense disambiguation for the word “bank”,{overscore (x)} would be the context around an occurrence of the word; ywould be a particular sense, e.g., financial or river; ƒ_(i) ({overscore(x)}, y) could be 1 if the context includes the word “money” and y isthe financial sense; and ?_(i) would be a large positive number.

[0007] Maxent models have several valuable properties one of which isconstraint satisfaction. For a given ƒ_(i), the number of times ƒ_(i)was observed in the training data can be determined (observed$\left. {\lbrack i\rbrack = {\overset{\quad}{\sum_{j}}\quad {f_{i}\left( {\overset{\_}{x},y_{j}} \right)}}} \right).$

[0008] For a model P_({overscore (λ)}) with parameters {overscore (λ)},the number of times the model predicts ƒ_(i) can be determined$\left( {{\sum\limits^{\quad}}_{j,y}\quad {{P_{\overset{\_}{\lambda}}\left( y \middle| {\overset{\_}{x}}_{j} \right)}{f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)}}} \right).$

[0009] Maxent models have the property that expected [i]=observed [i]for all i. These equalities are called constraints. An additionalproperty of models in the form of Equation (1) is that the maxent modelmaximizes the probability of the training data. Yet another property isthat maxent models are as close as possible to the uniform distribution,subject to constraint satisfaction.

[0010] Maximum entropy models are conventionally learned usinggeneralized iterative scaling (GIS). At each iteration, a step is takenin a direction that increases the likelihood of the training data. Thestep size is determined to be not too large and not too small: thelikelihood of the training data increases at each iteration andeventually converges to the global optimum. Unfortunately, this comes ata price: GIS takes a step size inversely proportional to the maximumnumber of active constraints. Maxent models are interesting preciselybecause of their ability to combine many different kinds of information,so this weakness of GIS means that maxent models are slow to learnprecisely when they are most useful.

[0011] Those skilled in the art will recognize that with regard tosystems using values such as$^{\sum\limits_{j}\quad {\lambda_{j}{f_{j}{({\overset{\_}{x}|y})}}}},{\mu_{i} = {{^{\lambda_{j}}\quad {and}\quad {\prod\limits_{i}^{\quad}\quad \mu_{i}^{f_{j}{({\overset{\_}{x}|y})}}}} = ^{\sum\limits_{j}\quad {\lambda_{j}{f_{j}{({\overset{\_}{x}|y})}}}}}}$

[0012] are equivalent systems—the change from sums of λ values toproducts of μ values is essentially only a notational change.

SUMMARY OF THE INVENTION

[0013] The following presents a simplified summary of the invention inorder to provide a basic understanding of some aspects of the invention.This summary is not an extensive overview of the invention. It is notintended to identify key/critical elements of the invention or todelineate the scope of the invention. Its sole purpose is to presentsome concepts of the invention in a simplified form as a prelude to themore detailed description that is presented later.

[0014] The present invention provides for a system for training amachine learning system that can be used, for example, for a variety ofnatural language tasks, including Language Modeling, part-of-speechtagging, prepositional phrase attachment, parsing, word selection formachine translation, and finding sentence boundaries. The system isbased on employing sequential conditional generalized iterative scalingto train the machine learning system.

[0015] In accordance with an aspect of the present invention, the systemincludes an expected value update component, an error calculator, aparameter update component, a variable update component, a sum of lambdavariable store, and a normalization variable store.

[0016] The expected value update component can modify an expected valuefor a plurality of outputs and for a plurality of instances in which afeature function is non-zero, based, at least in part, upon the featurefunction of an input vector and an output value, a sum of lambdavariable and a normalization variable. The error calculator cancalculate an error based, at least in part, upon the expected value andan observed value. The parameter update component can modify a trainableparameter based, at least in part, upon the error. The variable updatecomponent can update the sum of lambda variable and/or the normalizationvariable for a plurality of outputs and for a plurality of instances inwhich a feature function is non-zero, based, at least in part, upon theerror. The sum of lambda variable store can store the sum of lambdavariables and the normalization variable store can store thenormalization variables. The system can sequentially update trainableparameters, for example, for each feature function until the trainableparameters have converged.

[0017] Conventionally, maximum entropy (maxent) models are trained usinggeneralized iterative scaling (GIS). At each iteration, a step is takenin a direction that increases the likelihood of the training data. Thestep size is determined to be not too large and not too small: thelikelihood of the training data increases at each iteration andeventually converges to the global optimum.

[0018] The system of the present invention employs sequentialconditional generalized iterative (SCGIS) scaling to train the machinelearning system. Thus, rather than learning substantially all trainableparameters of the model simultaneously, the system learns themsequentially: one, then the next etc., and then back to the beginning.The system can cache subcomputations (e.g., sum of lambda variableand/or normalization variable), for example, to increase speed of thesystem.

[0019] Conventional GIS-based algorithms employ training data stored asa sparse matrix of feature functions with non-zero values for eachinstance. In accordance with an aspect of the present invention, thesequential conditional generalized iterative scaling of the presentinvention employs training data stored as a sparse matrix of instanceswith non-zero values for each feature function.

[0020] Yet another aspect of the present invention provides for thesystem to further include a training data store and a parameter store.The training data store stores input vector(s) and/or the observedvalue(s). In one example, information is stored in the training datastore so as to facilitate efficient transfer of information within thesystem (e.g., employing suitable caching technique(s)). In a secondexample, information is stored in the training data store in a sparserepresentation to facilitate computational speed of the system.

[0021] The parameter store stores at least one of the trainableparameters. In one example, information is stored in the parameter storeto facilitate efficient transfer of information within the system (e.g.,employing suitable caching technique(s)).

[0022] In accordance with another aspect of the present invention, thesequential conditional generalized iterative scaling of the presentinvention can be combined with other technique(s) in order to facilitatemachine learning. For example, SCGIS can be employed with wordclustering, improved iterative scaling and/or smoothing.

[0023] Briefly, the word clustering speedup (which can be applied toproblem(s) with many outputs and is not limited to words) works asfollows. In both conventional GIS and SCGIS as provided above, there areloops over substantially all outputs, y. Even with certain optimizationsthat can be applied in practice, the length of these loops is stillbounded by, and often can be proportional to, the number of outputs.Word clustering therefore changes from a model of the formP(y|{overscore (x)}) to modeling P(cluster(y)|{overscore(x)})×P(y|{overscore (x)}, cluster(y)).

[0024] Consider a language model in which y is a word, the number ofoutputs is the vocabulary size, and x represents the words preceding y.For example, the vocabulary can have 10,000 words. Then for a modelP(y|{overscore (x)}), there are 10,000 outputs. On the other hand, if100 word clusters are created, each with 100 words per cluster, then fora model P(cluster(y)|{overscore (x)}) (“cluster model”), there are 100outputs, and for a model P(y|{overscore (x)}, cluster(y)) (“word model”)there are also 100 outputs. This means that instead of training onemodel with a time proportional to 10,000 two models are trained, eachwith time proportional to 100.

[0025] Thus, in this example, there is a 50 times speedup. In practice,the speedups are not quite so large, but speedups of up to a factor of35 are possible. Although the model form learned is not exactly the sameas the original model, the perplexity of the form using two models isactually marginally lower (better) than the perplexity of the form usinga single model, so there does not seem to be any disadvantage to usingit.

[0026] According to an aspect of the present invention, a trainingsystem employing sequential conditional generalized iterative scalingfor models using word clustering is provided. The training systemincludes an expected value update component, an error calculator, aparameter update component, a variable update component, a sum of lambdavariable store, a normalization variable store, a training data store,trainable class parameters and trainable word parameters.

[0027] The trainable class parameters are trained employing sequentialconditional generalize iterative scaling as described previously. Thetrainable word parameters are then likewise trained.

[0028] Yet another aspect of the present invention provides for amachine learning system employing parameters trained using sequentialconditional generalized iterative scaling.

[0029] Other aspects of the present invention provide methods fortraining a machine learning system, a computer readable medium havingcomputer executable components for a system facilitating training of amachine learning system, and a data packet adapted to be transmittedbetween two or more computer processes comprising a data fieldcomprising a trained parameter for a machine learning system, thetrained parameter having been trained based, at least in part, uponsequential conditional generalized iterative scaling.

[0030] To the accomplishment of the foregoing and related ends, certainillustrative aspects of the invention are described herein in connectionwith the following description and the annexed drawings. These aspectsare indicative, however, of but a few of the various ways in which theprinciples of the invention may be employed and the present invention isintended to include all such aspects and their equivalents. Otheradvantages and novel features of the invention may become apparent fromthe following detailed description of the invention when considered inconjunction with the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0031]FIG. 1 is a block diagram of a system for training a machinelearning system in accordance with an aspect of the present invention.

[0032]FIG. 2 is an exemplary data structure in accordance with an aspectof the present invention.

[0033]FIG. 3 is an exemplary data structure in accordance with an aspectof the present invention.

[0034]FIG. 4 is a block diagram of a system for training a machinelearning system in accordance with an aspect of the present invention.

[0035]FIG. 5 is a block diagram of a machine learning system inaccordance with an aspect of the present invention.

[0036]FIG. 6 is a block diagram of a machine learning system employingtrained parameters in accordance with an aspect of the presentinvention.

[0037]FIG. 7 is a flow chart illustrating a method for training alearning system in accordance with an aspect of the present invention.

[0038]FIG. 8 is a flow chart illustrating a method for training alearning system in accordance with an aspect of the present invention.

[0039]FIG. 9 is a flow chart further illustrating the method of FIG. 8.

[0040]FIG. 10 is a flow chart illustrating a method for training alearning system in accordance with an aspect of the present invention.

[0041]FIG. 11 illustrates an example operating environment in which thepresent invention may function.

DETAILED DESCRIPTION OF THE INVENTION

[0042] The present invention is now described with reference to thedrawings, wherein like reference numerals are used to refer to likeelements throughout. In the following description, for purposes ofexplanation, numerous specific details are set forth in order to providea thorough understanding of the present invention. It may be evident,however, that the present invention may be practiced without thesespecific details. In other instances, well-known structures and devicesare shown in block diagram form in order to facilitate describing thepresent invention.

[0043] As used in this application, the term “computer component” isintended to refer to a computer-related entity, either hardware, acombination of hardware and software, software, or software inexecution. For example, a computer component may be, but is not limitedto being, a process running on a processor, a processor, an object, anexecutable, a thread of execution, a program, and/or a computer. By wayof illustration, both an application running on a server and the servercan be a computer component. One or more computer components may residewithin a process and/or thread of execution and a component may belocalized on one computer and/or distributed between two or morecomputers.

[0044] Referring to FIG. 1, a system for training a machine learningsystem 100 in accordance with an aspect of the present invention isillustrated. The system 100 includes an expected value update component110, an error calculator 120, a parameter update component 130, avariable update component 140, a sum of lambda variable store 150, and anormalization variable store 160.

[0045] The system 100 can be utilized to train a machine learning systemthat can be used, for example, for a variety of natural language tasks,including Language Modeling, part-of-speech tagging, prepositionalphrase attachment, parsing, word selection for machine translation, andfinding sentence boundaries. The system 100 is based on employingsequential conditional generalized iterative scaling (SCGIS) to trainthe machine learning system.

[0046] As discussed previously, maxent models are conventionally learnedusing GIS. At each iteration, a step is taken in a direction thatincreases the likelihood of the training data. The step size isdetermined to be not too large and not too small: the likelihood of thetraining data increases at each iteration and eventually converges tothe global optimum. Unfortunately, this comes at a price: GIS takes astep size inversely proportional to the maximum number of activeconstraints. Maxent models are interesting precisely because of theirability to combine many different kinds of information, so this weaknessof GIS means that maxent models are slow to learn precisely when theyare most useful.

[0047] The system 100 employs SCGIS to train the machine learningsystem. Thus, rather than learning substantially all trainableparameters of the model simultaneously, the system 100 learns themsequentially: one, then the next etc., and then back to the beginning.The system 100 can cache subcomputations, for example, to increase speedof the system 100.

[0048] Conventional GIS-based algorithms employ training data stored asa sparse matrix of feature functions with non-zero values for eachinstance. In accordance with an aspect of the present invention, thesequential conditional generalized iterative scaling of the presentinvention employs training data stored as a sparse matrix of instanceswith non-zero values for each feature function.

[0049] Additionally, in accordance with another aspect of the presentinvention, SCGIS utilizes max_(j,y) ƒ_(i)({overscore (x)}, y).Conventional GIS utilizes an ƒ^(#) function defined as: ƒ^(#)=max_(j,y)$\sum\limits_{i}^{\quad}\quad {{f_{i}\left( {\overset{\_}{x},y} \right)}.}$

[0050] Thus, ƒ^(#) of conventional GIS is equal to the largest totalvalue of ƒ_(i). The max_(j,y) ƒ_(i)({overscore (x)}, y) of SCGIS canthus provide significant speedup over conventional GIS. Further, in manymaxent applications, the ƒ_(i) take on only the values 0 or 1, and thus,typically max_(j,y) ƒ_(i)({overscore (x)}, y)=1. As such, instead ofslowing by a factor of ƒ^(#), there may be no significant slowing atall. Further, with SCGIS, instead of updating all λ's simultaneously asin conventional GIS, each feature function can be looped over and anupdate for that feature function can be computed, in turn.

[0051] The expected value update component 110 can modify an expectedvalue based, at least in part, upon a feature function of an inputvector and an output value, a sum of lambda variable and a normalizationvariable. In one example, the expected value update component 100modifies the expected value based, at least in part, upon the followingequation:

expected value=expected value+ƒ_(i)({overscore (x)} _(j) , y)e ^(s[j,y])/z[j]  (2)

[0052] where ƒ_(i)({overscore (x)}_(j), y) is the feature function,{overscore (x)}_(j) is the input vector, y is the output, s [j,y] is thesum of lambda variable, and, z[j] is the normalization variable.

[0053] Turning briefly to FIG. 2, an exemplary data structure 200 inaccordance with an aspect of the present invention is illustrated. Thedata structure 200 includes Y number of rows and I number of columns,where Y is the number of output classes (values for y), I is the numberof training instances. For example, the data structure 200 can beutilized to store information associated with the sum of lambdavariable.

[0054] Next, referring briefly to FIG. 3, an exemplary data structure300 in accordance with an aspect of the present invention isillustrated. The data structure 300 includes I number of elements, whereI is the number of training instances. For example, the data structure300 can be utilized to store information associated with thenormalization variable.

[0055] The data structures 200, 300 are merely exemplary and it is to beappreciated that numerous other structures are contemplated that providefor organizing and/or storing a plurality of data types conducive tofacilitating the training of machine learning system(s) in connectionwith the subject invention. Any such data structure suitable foremployment in connection with the present invention is intended to fallwithin the scope of the appended claims. Such data structures can bestored in computer readable media including, but not limited to,memories, disks and carrier waves.

[0056] Turning back to FIG. 1, the error calculator 120 can calculate anerror based, at least in part, upon the expected value and an observedvalue. In one example, the error is based, at least in part, upon thefollowing equation: $\begin{matrix}{\delta_{i} = {\frac{1}{\max_{j,v}{f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)}}{\log \left( {{observed}\quad {{value}\lbrack i\rbrack}\text{/}{expected}\quad {{value}\lbrack i\rbrack}} \right)}}} & (3)\end{matrix}$

[0057] where ƒ_(i)({overscore (x)}_(j), y) is the feature function,{overscore (x)}_(j) is the input vector, and, y is the output.

[0058] The parameter update component 130 can modify a trainableparameter based, at least in part, upon the error. In one example,modification of the trainable parameter is based, at least in part, uponthe following equation:

λ_(i)=λ_(i)+δ_(i)  (4)

[0059] where λ_(i) is the trainable parameter, and, δ_(i) is the error.Thus, in accordance with an aspect of the present invention, each λ_(i)is updated immediately after expected[i] is computed, rather than afterexpected values for all features have been computed as done withconventional GIS.

[0060] The variable update component 140 can update the sum of lambdavariable and/or the normalization variable based, at least in part, uponthe error. In one example, updating of the sum of lambda variable andthe normalization variable is based, at least in part, upon thefollowing equations:

z[j]=z[j]−e ^(s[j,y])  (5)

s[j,y]=s[j,y]+δ _(i)  (6)

z[j]=z[j]+e ^(s[j,y])  (7)

[0061] where s[j,y] is the sum of lambda variable, z [j] is thenormalization variable, and, δ_(i) is the error. Thus, rather thanrecomputing for each instance j and each output y,${s\left\lbrack {j,y} \right\rbrack} = {\sum\limits_{i}^{\quad}\quad {\lambda_{i} \times {f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)}}}$

[0062] and the corresponding normalizing factors${z = {\sum\limits_{v}^{\quad}\quad ^{s{\lbrack{j,v}\rbrack}}}},$

[0063] these arrays can be computed and stored as invariants, andincrementally updated whenever a λ_(i) changes. This can lead to asubstantial speed up of the system 100.

[0064] The system 100 can sequentially update trainable parameters(λ_(i)), for example, for each feature function. Pseudo-code forimplementing SCGIS in accordance with an aspect of the present inventionfollows: TABLE 1 z[1..I] = Y s[1..I, 1..Y] = 0 for each iteration foreach feature function f_(i) expected value = 0; for each output y${{for}\quad {each}\quad {instance}\quad j\quad {such}\quad {that}\quad {f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)}} \neq 0$

${{expected}\quad {value}}+={{f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)}\quad {^{s{\lbrack{j,v}\rbrack}}/{z\lbrack j\rbrack}}}$

${\delta_{i} = {\frac{1}{\max_{j,v}{f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)}}\log \quad \left( {{observed}\quad {{{value}\lbrack i\rbrack}/{expected}}\quad {{value}\lbrack i\rbrack}} \right)}};$

λ_(i) = λ_(i) + δ_(i); for each output y for each instance j such thatf_(i)({overscore (x)}_(j), y) ≠ 0 z[j]−= e^(s[j,v]) s[j,y]+δ_(i)z[j]+=e^(s[j,y])

[0065] where s[ ] is the sum of lambda variable, z[ ] is thenormalization variable, Y is the number of output classes (values fory), I is the number of training instances, and, i indexes the featurefunctions.

[0066] In order to demonstrate the advantages of SCGIS over conventionalGIS, the time for each algorithm to execute one iteration can becompared. Pseudo-code for conventional GIS follows: TABLE 2 for eachiteration expected value[0..I]=0; for each training instance j for eachoutput y s[j,y]:=0; for each i such that f_(i)({overscore (x)}_(j), y) ≠0 s[j,y] += λ₁ × f_(i)({overscore (x)}_(j), y)$z:={\sum\limits_{y}^{s{\lbrack{j,v}\rbrack}}}$

for each output y for each i such that f_(i)({overscore (x)}_(j), y) ≠ 0expected value[i]+=f_(i)({overscore (x)}_(j), y) × e^(s[j,u])/z for each$\begin{matrix}{\delta_{i} = {\frac{1}{f^{\#}}{\log \left( {{observed}\quad {{{value}\lbrack i\rbrack}/{expected}}\quad {{value}\lbrack i\rbrack}} \right)}}} \\{\lambda_{i}+=\delta_{i}}\end{matrix}\quad$

[0067] Assume that for every instance and output there is at least onenon-zero indicator function, which is generally true in practice.Referring to Table 2, for GIS the top loops end up iterating over allnon-zero indicator functions (e.g., feature functions), for each output,for each training instance. In other words, they examine every entry inthe training matrix T once, and thus require time proportional to thesize of T. The bottom loops simply require time F, the number ofindicator functions, which is smaller than T. Thus, GIS require timeO(T).

[0068] Referring to Table 1, for SCGIS, the top loops are also over eachnon-zero entry in the training data, which take time O(T). The bottomloops also require time O(T). Thus, one iteration of SCGIS takes aboutas long as one iteration of GIS. However, the speed up of SCGIS comesfrom the step size: the update in GIS is slowed by ƒ^(#), while theupdate in SCGIS is not. Thus, SCGIS can converge by up to a factor ofƒ^(#) faster than GIS. For many applications, ƒ^(#) can be large.

[0069] The sum of lambda variable store 150 can store the sum of lambdavariables. The normalization variable store 160 can store thenormalization variables.

[0070] Those skilled in the art will appreciate that sequentialconditional generalized iterative scaling of the present invention canbe employed with maximum entropy models, logistic regression and/orperceptrons trained to minimize entropy.

[0071] While FIG. 1 is a block diagram illustrating components for thesystem 100, it is to be appreciated that the expected value updatecomponent 110, the error calculator 120, the parameter update component130 and/or the variable update component 140 can be implemented as oneor more computer components, as that term is defined herein. Thus, it isto be appreciated that computer executable components operable toimplement the system 100, the expected value update component 110, theerror calculator 120, the parameter update component 130 and/or thevariable update component 140 can be stored on computer readable mediaincluding, but not limited to, an ASIC (application specific integratedcircuit), CD (compact disc), DVD (digital video disk), ROM (read onlymemory), floppy disk, hard disk, EEPROM (electrically erasableprogrammable read only memory) and memory stick in accordance with thepresent invention.

[0072] Turning to FIG. 4, a system for training a machine learningsystem 400 in accordance with an aspect of the present invention isillustrated. The system 400 includes an expected value update component110, an error calculator 120, a parameter update component 130, avariable update component 140, a sum of lambda variable store 150, anormalization variable store 160, a training data store 170 and aparameter store 180.

[0073] The system 400 can be utilized to train a machine learning systemthat can be used, for example, for a variety of natural language tasks,including Language Modeling, part-of-speech tagging, prepositionalphrase attachment, parsing, word selection for machine translation, andfinding sentence boundaries. The system 400 is based on employingsequential conditional generalized iterative scaling to train themachine learning system as discussed previously.

[0074] The training data store 170 stores input vector(s) and/or theobserved value(s). In one example, information is stored in the trainingdata store 170 so as to facilitate efficient transfer of informationwithin the system 400 (e.g., employing suitable caching technique(s)).In a second example, information is stored in the training data store170 in a sparse representation to facilitate computational speed of thesystem 400. It is to be appreciated that information can be stored inthe training data store 170 in any suitable data structure including,but not limited to, databases, tables, records, arrays and lists.

[0075] The parameter store 180 stores at least one of the trainableparameters. In one example, information is stored in the parameter store180 so as to facilitate efficient transfer of information within thesystem 400 (e.g., employing suitable caching technique(s)). It is to beappreciated that information can be stored in the parameter store in anysuitable data structure including, but not limited to, databases,tables, records, arrays and lists.

[0076] In accordance with an aspect of the present invention, thesequential conditional generalize iterative scaling of the presentinvention can be combined with other technique(s) in order to facilitatemachine learning. For example, word clustering, improved iterativescaling and/or smoothing.

[0077] Word clustering is discussed in greater detail in copending U.S.patent application entitled METHOD AND APARATUS FOR FAST MACHINETRAINING, having Ser. No. 09/489,045 the entirety of which is herebyincorporated by reference. In one example, word clustering can lead to afactor of 35 speedup. Additionally, word clustering can facilitate theuse of sequential conditional generalize iterative scaling to beemployed on system(s) having a large number of outputs.

[0078] Briefly, the word clustering speedup (which can be applied toproblem(s) with many outputs and is not limited words) works as follows.In both conventional generalized iterative scaling and sequentialconditional generalize iterative scaling as provided above, there areloops over substantially all outputs, y. Even with certain optimizationsthat can be applied in practice, the length of these loops is stillbounded by, and often can be proportional to, the number of outputs.Word clustering therefore changes from a model of the formP(y|{overscore (x)}) to modeling P(cluster(y)|{overscore(x)})×P(y|{overscore (x)}, cluster(y)).

[0079] Consider a language model in which y is a word, the number ofoutputs is the vocabulary size, and x represents the words preceding y.For example, the vocabulary can have 10,000 words. Then for a modelP(y|{overscore (x)}), there are 10,000 outputs. On the other hand, 100word clusters are created, each with 100 words per cluster, then for amodel P(cluster(y)|{overscore (x)}) (“cluster model”), there are 100outputs, and for a model P(y|{overscore (x)}, cluster(y)) (“word model”)there are also 100 outputs. This means that instead of training onemodel with a time proportional to 10,000 two models are trained, eachwith time proportional to 100.

[0080] Thus, in this example, there is a 50 times speedup. In practice,the speedups are not quite so large, but speedups of up to a factor of35 can be achieved. Although the model form learned is not exactly thesame as the original model, the perplexity of the form using two modelsis actually marginally lower (better) than the perplexity of the formusing a single model, so there does not seem to be any disadvantage tousing it.

[0081] Additionally, SCGIS can be combined with improved iterativescaling (IIS) to achieve significant speedups. With IIS, instead oftreating ƒ^(#) as a constant, it can be treated as a function of{overscore (x)}_(j) and y. In particular, let${f^{\#}\left( {\overset{\_}{x},y} \right)} = {\sum\limits_{i}^{\quad}\quad {{f_{i}\left( {\overset{\_}{x},y} \right)}.}}$

[0082] Then, the following equation can be solved numerically for δ_(i):$\begin{matrix}{{{observed}\quad {{value}\quad\lbrack i\rbrack}} = {\sum\limits_{j,y}^{\quad}\quad {{P_{\overset{\_}{\lambda}}\left( y \middle| {\overset{\_}{x}}_{j} \right)} \times {f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)} \times {\exp\left( {\delta_{i}{f^{\#}\left( {{\overset{\_}{x}}_{j},y} \right)}} \right.}}}} & (8)\end{matrix}$

[0083] Notice that in the special case where ƒ^(#) ({overscore (x)}, y)is a constant ƒ^(#), equation (8) reduces to:

δ_(i)=log(observed value[i]/expected value[i])/ƒ^(#)  (9)

[0084] which is used to update trainable parameters (λ_(i)s) inconventional GIS. However, for training instances where ƒ^(#)({overscore(x)}, y) is <ƒ^(#), the IIS update can take a proportionately largerstep. Thus, IIS can lead to speedups when ƒ^(#)({overscore (x)}, y) issubstantially less than ƒ^(#).

[0085] IIS can be combined with SCGIS by using an update rule (e.g., foruse by the error calculator 12) where one solves the following equationfor δ_(i): $\begin{matrix}{{{observed}\quad {{value}\lbrack i\rbrack}} = {\sum\limits_{j,y}^{\quad}{{P_{\overset{\_}{\lambda}}\left( {{\overset{\_}{x}}_{j},y} \right)} \times {f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)} \times {\exp \left( {\delta_{i}{f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)}} \right)}}}} & (10)\end{matrix}$

[0086] For many model types, the ƒ_(i) takes only the values 1 or 0. Inthis case, equation (10) reduces to the normal SCGIS update.

[0087] Next, SCGIS can be combined with smoothing which can lead tosignificant speedups. Maximum entropy models are naturally maximallysmooth, in the sense that they are as close as possible to uniform,subject to satisfying the constraints. However, in practice, there maybe enough constraints that the models are not nearly smooth enough, thatis, they overfit the training data. In order to smooth models, aGaussian prior on the parameters can be assumed. The models no longersatisfy the constraints exactly, but work much better on test data. Inparticular, instead of attempting to maximize the probability of thetraining data, they maximize a slightly different objective function,the probability of the training data times the prior probability of themodel: $\begin{matrix}{\underset{\overset{\_}{\lambda}}{\arg \quad \max}{\prod\limits_{j = 1}^{J}\quad {{P_{\overset{\_}{\lambda}}\left( y_{j} \middle| {\overset{\_}{x}}_{j} \right)}{P\left( \overset{\_}{\lambda} \right)}}}} & (11)\end{matrix}$

[0088] where${P\left( \overset{\_}{\lambda} \right)} = {\prod\limits_{i = 1}^{I}\quad {\frac{1}{\sqrt{2\pi}\sigma}{^{- \frac{\lambda_{i}^{2}}{2\quad \sigma^{2}}}.}}}$

[0089] In other words, the probability of the λs is a simple normaldistribution with 0 mean, and a standard deviation of σ. For a modifiedupdate rule in which to find the updates, one solves for δ_(i) in:$\begin{matrix}{{{observed}\quad {{value}\lbrack i\rbrack}} = {{{expected}\quad {{value}\lbrack i\rbrack} \times ^{\delta,f^{\#}}} + \frac{\lambda + \delta_{i}}{\sigma^{2}}}} & (12)\end{matrix}$

[0090] SCGIS can be modified in a similar way to use an update rule inwhich one solves for δ_(i) in: $\begin{matrix}{{{observed}\quad {{value}\lbrack i\rbrack}} = {{{expected}\quad {{value}\lbrack i\rbrack} \times ^{{\delta_{i}{\max \quad}_{i,v}f^{\#}},{({{\overset{\_}{x}}_{j},y})}}} + \frac{\lambda + \delta_{i}}{\sigma^{2}}}} & (13)\end{matrix}$

[0091] In one example, SCGIS is combined with IIS and smoothing toachieve system speedups. Referring to FIG. 5, a training system 500employing sequential conditional generalized iterative scaling and wordclustering is illustrated. The system 500 includes an expected valueupdate component 110, an error calculator 120, a parameter updatecomponent 130, a variable update component 140, a sum of lambda variablestore 150, a normalization variable store 160, a training data store170, trainable class parameters 184 and trainable word parameters 188.

[0092] The system 500 can be utilized to train a machine learning systemthat can be used, for example, for a variety of natural language tasks,including Language Modeling, part-of-speech tagging, prepositionalphrase attachment, parsing, word selection for machine translation, andfinding sentence boundaries. The system 500 is based on sequentialconditional generalized iterative scaling employing word clustering totrain the machine learning system as discussed previously.

[0093] In one example, the output values y in the training data aredivided into classes. Thereafter, the trainable class parameters 184 aretrained employing sequential conditional generalize iterative scaling asdescribed previously. The trainable word parameters 188 are thenlikewise trained.

[0094] It is to be appreciated that the word clustering technique can beextended to multiple level(s). For example, by putting words intosuperclusters, such as their part of speech, and clusters, such assemantically similar words of a given part of speech, a three levelmodel can be employed. In another example, the technique can be extendedto up to log₂ Y levels with two outputs per level, meaning that thespace requirements are proportional to 2 instead of to the original Y.

[0095] Referring next to FIG. 6, a system 600 having a machine learningsystem employing trained parameters in accordance with an aspect of thepresent invention is illustrated. The system 600 includes an inputcomponent 610, a machine learning system 620 and a trained parameterstore 630.

[0096] The input component 610 provides an input vector to the machinelearning system 620. For example, the input vector can be based, atleast in part, upon information received from a keyboard, a mouse, aspeech recognition system, a tablet, a pen device, a photocopier, adocument scanner, an optical character recognition system, a personaldigital assistant, a fax machine and/or a tablet personal computer.

[0097] The machine learning system 620 receives the input vector fromthe input component 610 and, based, at least in part, upon trainedparameter(s) stored in the trained parameter store 630, provides anoutput.

[0098] The trained parameter store 630 stores trained parameter(s)trained, at least in part, upon sequential conditional generalizeiterative scaling as described above.

[0099] In view of the exemplary systems shown and described above,methodologies that may be implemented in accordance with the presentinvention will be better appreciated with reference to the flow chart ofFIGS. 7, 8, 9 and 10. While, for purposes of simplicity of explanation,the methodologies are shown and described as a series of blocks, it isto be understood and appreciated that the present invention is notlimited by the order of the blocks, as some blocks may, in accordancewith the present invention, occur in different orders and/orconcurrently with other blocks from that shown and described herein.Moreover, not all illustrated blocks may be required to implement themethodologies in accordance with the present invention.

[0100] The invention may be described in the general context ofcomputer-executable instructions, such as program modules, executed byone or more components. Generally, program modules include routines,programs, objects, data structures, etc. that perform particular tasksor implement particular abstract data types. Typically the functionalityof the program modules may be combined or distributed as desired invarious embodiments.

[0101] Turning to FIG. 7, a method 700 for training a machine learningsystem in accordance with an aspect of the present invention isillustrated. At 710, an expected value is updated based, at least inpart, upon a feature function of an input vector and an output value, asum of lambda variable and a normalization variable (e.g., based onequation (2)). At 720, an error is calculated based, at least in part,upon the expected value and an observed value (e.g., based on equation(3)). At 730, a trainable parameter is modified based, at least in part,upon the error (e.g., based on equation (4)). At 740, the sum of lambdavariable and/or the normalization variable is updated based, at least inpart, upon the error (e.g., based on equations (5), (6) and/or (7)). Inone example, 710, 720, 730 and 740 are performed for each featurefunction in order to perform an iteration of SCGIS.

[0102] Referring next to FIGS. 8 and 9, a method 800 for training amachine learning system in accordance with an aspect of the presentinvention is illustrated (e.g., based, at least in part, upon the pseudocode of Table 1). At 810, initialization is performed. For example, aset of sum of lambda variables and/or a set of normalization variablescan be initialized. At 820, an expected value is initialized. Next, at830, for each output, for each instance that the feature function is notzero, the expected value is updated based, at least in part, upon afeature function of an input vector and an output value, a sum of lambdavariable and a normalization variable (e.g., based on equation (2)). At840, a determination is made as to whether there are more outputs. Ifthe determination at 840 is YES, processing continues at 830. If thedetermination at 840 is NO, at 850, an error is calculated based, atleast in part, upon the expected value and an observed value (e.g.,based on equation (3)). At 860, a trainable parameter is modified based,at least in part, upon the error (e.g., based on equation (4)).

[0103] At 870, for each output, for each instance that the featurefunction is not zero, at least one of the sum of lambda variable and thenormalization variable is updated based, at least in part, upon theerror (e.g., based on equations (5), (6) and/or (7)). At 880, adetermination is made as to whether there are more outputs. If thedetermination at 880 is YES, processing continues at 870. If thedetermination at 880 is NO, at 890, a determination is made as towhether there are more feature functions. If the determination at 890 isYES, processing continues at 820. If the determination at 890 is NO, nofurther processing occurs.

[0104] In one example, 820 through 890 are performed for each featurefunction in order to perform an iteration of SCGIS.

[0105] Turning to FIG. 10, a method 1000 for training a machine learningsystem in accordance with an aspect of the present invention isillustrated. At 1010, trainable class parameters are trained based, atleast in part, upon sequential conditional generalized iterativescaling. At 1020, trainable word parameters are trained based, at leastin part, upon sequential conditional generalized iterative scaling.

[0106] In order to provide additional context for various aspects of thepresent invention, FIG. 11 and the following discussion are intended toprovide a brief, general description of a suitable operating environment1110 in which various aspects of the present invention may beimplemented. While the invention is described in the general context ofcomputer-executable instructions, such as program modules, executed byone or more computers or other devices, those skilled in the art willrecognize that the invention can also be implemented in combination withother program modules and/or as a combination of hardware and software.Generally, however, program modules include routines, programs, objects,components, data structures, etc. that perform particular tasks orimplement particular data types. The operating environment 1110 is onlyone example of a suitable operating environment and is not intended tosuggest any limitation as to the scope of use or functionality of theinvention. Other well known computer systems, environments, and/orconfigurations that may be suitable for use with the invention includebut are not limited to, personal computers, hand-held or laptop devices,multiprocessor systems, microprocessor-based systems, programmableconsumer electronics, network PCs, minicomputers, mainframe computers,distributed computing environments that include the above systems ordevices, and the like.

[0107] With reference to FIG. 11, an exemplary environment 1110 forimplementing various aspects of the invention includes a computer 1112.The computer 1112 includes a processing unit 1114, a system memory 1116,and a system bus 1118. The system bus 1118 couples system componentsincluding, but not limited to, the system memory 1116 to the processingunit 1114. The processing unit 1114 can be any of various availableprocessors. Dual microprocessors and other multiprocessor architecturesalso can be employed as the processing unit 1114.

[0108] The system bus 1118 can be any of several types of busstructure(s) including the memory bus or memory controller, a peripheralbus or external bus, and/or a local bus using any variety of availablebus architectures including, but not limited to, 11-bit bus, IndustrialStandard Architecture (ISA), Micro-Channel Architecture (MSA), ExtendedISA (EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB),Peripheral Component Interconnect (PCI), Universal Serial Bus (USB),Advanced Graphics Port (AGP), Personal Computer Memory CardInternational Association bus (PCMCIA), and Small Computer SystemsInterface (SCSI).

[0109] The system memory 1116 includes volatile memory 1120 andnonvolatile memory 1122. The basic input/output system (BIOS),containing the basic routines to transfer information between elementswithin the computer 1112, such as during start-up, is stored innonvolatile memory 1122. By way of illustration, and not limitation,nonvolatile memory 1122 can include read only memory (ROM), programmableROM (PROM), electrically programmable ROM (EPROM), electrically erasableROM (EEPROM), or flash memory. Volatile memory 1120 includes randomaccess memory (RAM), which acts as external cache memory. By way ofillustration and not limitation, RAM is available in many forms such assynchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM),double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), SynchlinkDRAM (SLDRAM), and direct Rambus RAM (DRRAM).

[0110] Computer 1112 also includes removable/nonremovable,volatile/nonvolatile computer storage media. FIG. 11 illustrates, forexample a disk storage 1124. Disk storage 1124 includes, but is notlimited to, devices like a magnetic disk drive, floppy disk drive, tapedrive, Jaz drive, Zip drive, LS-100 drive, flash memory card, or memorystick. In addition, disk storage 1124 can include storage mediaseparately or in combination with other storage media including, but notlimited to, an optical disk drive such as a compact disk ROM device(CD-ROM), CD recordable drive (CD-R Drive), CD rewritable drive (CD-RWDrive) or a digital versatile disk ROM drive (DVD-ROM). To facilitateconnection of the disk storage devices 1124 to the system bus 1118, aremovable or non-removable interface is typically used such as interface1126.

[0111] It is to be appreciated that FIG. 11 describes software that actsas an intermediary between users and the basic computer resourcesdescribed in suitable operating environment 1110. Such software includesan operating system 1128. Operating system 1128, which can be stored ondisk storage 1124, acts to control and allocate resources of thecomputer system 1112. System applications 1130 take advantage of themanagement of resources by operating system 1128 through program modules1132 and program data 1134 stored either in system memory 1116 or ondisk storage 1124. It is to be appreciated that the present inventioncan be implemented with various operating systems or combinations ofoperating systems.

[0112] A user enters commands or information into the computer 1112through input device(s) 1136. Input devices 1136 include, but are notlimited to, a pointing device such as a mouse, trackball, stylus, touchpad, keyboard, microphone, joystick, game pad, satellite dish, scanner,TV tuner card, digital camera, digital video camera, web camera, and thelike. These and other input devices connect to the processing unit 1114through the system bus 1118 via interface port(s) 1138. Interfaceport(s) 1138 include, for example, a serial port, a parallel port, agame port, and a universal serial bus (USB). Output device(s) 1140 usesome of the same type of ports as input device(s) 1136. Thus, forexample, a USB port may be used to provide input to computer 1112, andto output information from computer 1112 to an output device 1140.Output adapter 1142 is provided to illustrate that there are some outputdevices 1140 like monitors, speakers, and printers among other outputdevices 1140 that require special adapters. The output adapters 1142include, by way of illustration and not limitation, video and soundcards that provide a means of connection between the output device 1140and the system bus 1118. It should be noted that other devices and/orsystems of devices provide both input and output capabilities such asremote computer(s) 1144.

[0113] Computer 1112 can operate in a networked environment usinglogical connections to one or more remote computers, such as remotecomputer(s) 1144. The remote computer(s) 1144 can be a personalcomputer, a server, a router, a network PC, a workstation, amicroprocessor based appliance, a peer device or other common networknode and the like, and typically includes many or all of the elementsdescribed relative to computer 1112. For purposes of brevity, only amemory storage device 1146 is illustrated with remote computer(s) 1144.Remote computer(s) 1144 is logically connected to computer 1112 througha network interface 1148 and then physically connected via communicationconnection 1150. Network interface 1148 encompasses communicationnetworks such as local-area networks (LAN) and wide-area networks (WAN).LAN technologies include Fiber Distributed Data Interface (FDDI), CopperDistributed Data Interface (CDDI), Ethernet/IEEE 1102.3, Token Ring/IEEE1102.5 and the like. WAN technologies include, but are not limited to,point-to-point links, circuit switching networks like IntegratedServices Digital Networks (ISDN) and variations thereon, packetswitching networks, and Digital Subscriber Lines (DSL).

[0114] Communication connection(s) 1150 refers to the hardware/softwareemployed to connect the network interface 1148 to the bus 1118. Whilecommunication connection 1150 is shown for illustrative clarity insidecomputer 1112, it can also be external to computer 1112. Thehardware/software necessary for connection to the network interface 1148includes, for exemplary purposes only, internal and externaltechnologies such as, modems including regular telephone grade modems,cable modems and DSL modems, ISDN adapters, and Ethernet cards.

[0115] What has been described above includes examples of the presentinvention. It is, of course, not possible to describe every conceivablecombination of components or methodologies for purposes of describingthe present invention, but one of ordinary skill in the art mayrecognize that many further combinations and permutations of the presentinvention are possible. Accordingly, the present invention is intendedto embrace all such alterations, modifications and variations that fallwithin the spirit and scope of the appended claims. Furthermore, to theextent that the term “includes” is used in either the detaileddescription or the claims, such term is intended to be inclusive in amanner similar to the term “comprising” as “comprising” is interpretedwhen employed as a transitional word in a claim.

What is claimed is:
 1. A system for training a machine learning system,comprising: an expected value update component that, for a plurality ofoutputs and for a plurality of instances in which a single featurefunction is non-zero, modifies an expected value based, at least inpart, upon the single feature function of an input vector and an outputvalue, a sum of lambda variable and a normalization variable; an errorcalculator that calculates an error based, at least in part, upon theexpected value and an observed value; a parameter update component thatmodifies a trainable parameter based, at least in part, upon the error;and, a variable update component that, for the plurality of outputs andfor the plurality of instances in which the feature function isnon-zero, sequentially updates at least one of the sum of lambdavariable and the normalization variable based, at least in part, uponthe error.
 2. The system of claim 1, the error calculation furtheremploying, at least in part, the following equation:${{observed}\quad {{value}\lbrack i\rbrack}} = {{{expected}\quad {{value}\lbrack i\rbrack} \times ^{{\delta_{i}{\max \quad}_{i,v}f^{\#}},{({{\overset{\_}{x}}_{j},y})}}} + \frac{\lambda_{i} + \delta_{i}}{\sigma^{2}}}$

where λ_(i) is the trainable parameter, δ_(i) is the error, σ is astandard deviation, ƒ_(i)({overscore (x)}_(j), y) is the featurefunction, {overscore (x)}_(j) is the input vector, and, y is the output.3. The system of claim 1, the error calculation further employing, atleast in part, the following equation:${{observed}\quad {{value}\lbrack i\rbrack}} = {\sum\limits_{j,v}^{\quad}{{P_{\overset{\_}{\lambda}}\left( {{\overset{\_}{x}}_{j},y} \right)} \times {f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)} \times {\exp \left( {\delta_{i}{f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)}} \right)}}}$

where {overscore (λ)} is a trainable parameter vector, δ_(i) is theerror, ƒ_(i)({overscore (x)}_(j), y) is the feature function, {overscore(x)}_(j) is the input vector, and, y is the output.
 4. The system ofclaim 1, modification of the expected value being based, at least inpart, upon the following equation: expected value=expectedvalue+ƒ_(i)({overscore (x)} _(j) , y)e ^(s[j,y]) /z[j] whereƒ_(i)({overscore (x)}_(j), y) is the feature function, {overscore(x)}_(j) is the input vector, y is the output, s [j,y] is the sum oflambda variable, and, z [j] is the normalization variable.
 5. The systemof claim 1, the error being based, at least in part, upon the followingequation:$\delta_{i} = {\frac{1}{\max_{j,y}{f_{i}\left( {{\overset{\_}{x}}_{j},y} \right)}}\log \quad \left( {{observed}\quad {{{value}\lbrack i\rbrack}/{expected}}\quad {{value}\lbrack i\rbrack}} \right)}$

where ƒ_(i)({overscore (x)}_(j), y) is the feature function, {overscore(x)}_(j) is the input vector, and, y is the output.
 6. The system ofclaim 1, modification of the trainable parameter being based, at leastin part, upon the following equation: λ_(i)=λ_(i)+δ_(i) where λ_(i) isthe trainable parameter, and, δ_(i) is the error.
 7. The system of claim1, updating of the sum of lambda variable and the normalization variablebeing based, at least in part, upon the following equation: z[j]=z[j]−e^(s[j,y]) s[j,y]=s[j,y]+δ _(i) z[j]=z[j]+e ^(s[j,y]) where s [j,y] isthe sum of lambda variable, z [j] is the normalization variable, and,δ_(i) is the error.
 8. The system of claim 1, further comprising atraining data store that stores at least one of the observed value andthe input vector.
 9. The system of claim 8, at least one of the observedvalue and the input vector being stored in a sparse representation. 10.The system of claim 1, further comprising a parameter store that storesat least one trainable parameter.
 11. A machine learning system trainedusing the system of claim
 1. 12. A system for training a machinelearning system, comprising: an expected value update component that,for a plurality of outputs and for a plurality of instances in which asingle feature function is non-zero, modifies an expected value based,at least in part, upon the single feature function of an input vectorand an output value, a sum of lambda variable and a normalizationvariable; an error calculator that calculates an error based, at leastin part, upon the expected value and an observed value; a parameterupdate component that modifies class trainable parameters or wordtrainable parameters based, at least in part, upon the error; and, avariable update component that, for the plurality of outputs and for theplurality of instances in which the feature function is non-zero,sequentially updates at least one of the sum of lambda variable and thenormalization variable based, at least in part, upon the error.
 13. Thesystem of claim 12, the class trainable parameters being trained beforethe word trainable parameters are trained.
 14. The system of claim 12,further comprising a training data store that stores at least one of theobserved value and the input vector.
 15. A method for training a machinelearning system, comprising: for each feature function, updating anexpected value based, at least in part, upon a feature function of aninput vector and an output value, a sum of lambda variable and anormalization variable; for each feature function, calculating an errorbased, at least in part, upon the expected value and an observed value;for each feature function, modifying a trainable parameter based, atleast in part, upon the error; and, for each feature function, updatingat least one of the sum of lambda variable and the normalizationvariable based, at least in part, upon the error.
 16. The method ofclaim 15, further comprising at least one of word clustering, smoothingand improved iterative scaling.
 17. A method for training a machinelearning system, comprising: updating an expected value based, at leastin part, upon a feature function of an input vector and an output value,a sum of lambda variable and a normalization variable, for each output,for each instance that the feature function is not zero; calculating anerror based, at least in part, upon the expected value and an observedvalue; modifying a trainable parameter based, at least in part, upon theerror; and, updating at least one of the sum of lambda variable and thenormalization variable based, at least in part, upon the error, for eachoutput, for each instance that the feature function is not zero.
 18. Themethod of claim 17, further comprising at least one of the followingacts: performing general initialization; resetting an expected value;determining whether there are more outputs; and, determining whetherthere are more feature functions.
 19. A method for training a learningsystem, comprising: training trainable class parameters based, at leastin part, upon sequential conditional generalized iterative scaling; and,training trainable word parameters based, at least in part, uponsequential conditional generalized iterative scaling.
 20. A data packettransmitted between two or more computer components that facilitatestraining a machine learning system, the data packet comprising: a datafield comprising a trained parameter for a machine learning system, thetrained parameter having been trained based, at least in part, uponsequential conditional generalized iterative scaling.
 21. A computerreadable medium storing computer executable components of a systemfacilitating training of a machine learning system, comprising: anexpected value update component that modifies an expected value for aplurality of outputs and for a plurality of instances in which a singlefeature function is non-zero based, at least in part, upon the singlefeature function of an input vector and an output value, a sum of lambdavariable and a normalization variable; an error calculator componentthat calculates an error based, at least in part, upon the expectedvalue and an observed value; a parameter update component that modifiesa trainable parameter based, at least in part, upon the error; and, avariable update component that sequentially updates at least one of thesum of lambda variable and the normalization variable for the pluralityof outputs and for the plurality of instances in which the featurefunction is non-zero based, at least in part, upon the error.
 22. Atraining system for a machine learning system, comprising: means formodifying an expected value for a plurality of outputs and for aplurality of instances in which a feature function is non-zero based, atleast in part, upon the feature function of an input vector and anoutput value, a sum of lambda variable and a normalization variable;means for calculating an error based, at least in part, upon theexpected value and an observed value; means for modifying a trainableparameter based, at least in part, upon the error; and, means forupdating at least one of the sum of lambda variable and thenormalization variable for the plurality of outputs and for theplurality of instances in which the feature function is non-zero based,at least in part, upon the error.